Skip to content

Comments

[Attention][UX][1/N] Add AttentionConfig and change attention env var…#30

Open
MitchLewis930 wants to merge 1 commit intoconfig_warning_beforefrom
config_warning_after
Open

[Attention][UX][1/N] Add AttentionConfig and change attention env var…#30
MitchLewis930 wants to merge 1 commit intoconfig_warning_beforefrom
config_warning_after

Conversation

@MitchLewis930
Copy link
Collaborator

@MitchLewis930 MitchLewis930 commented Jan 24, 2026

Test

Summary by CodeRabbit

Release Notes

  • Refactor

    • Migrated attention backend configuration from environment variables to a structured configuration system, enabling standardized backend selection via CLI flags or config files.
    • Removed legacy context-manager-based backend forcing mechanisms in favor of declarative configuration.
  • Tests

    • Updated test infrastructure to use new configuration-based attention backend selection.

✏️ Tip: You can customize this high-level summary in your review settings.

…s to CLI arguments (vllm-project#26315)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

A comprehensive refactoring replaces environment variable and context-manager based attention backend configuration with a unified AttentionConfig class integrated into VllmConfig. Backend selection, flash attention versioning, quantization toggles, and optimization flags now derive from configuration layers throughout the attention system and engine setup.

Changes

Cohort / File(s) Summary
Configuration Infrastructure
vllm/config/attention.py, vllm/config/__init__.py, vllm/config/vllm.py
New AttentionConfig dataclass centralizes attention settings (backend selection, flash-attn versioning, prefill/decode kernels, CUDA graph splits, cudnn, TRTLLM, FP8 quantization). Includes environment-driven overrides with deprecation warnings, backend parsing, and compute_hash() method. Integrated into VllmConfig as a new field with default factory.
Backend Selector & Selection Logic
vllm/attention/selector.py, vllm/attention/utils/fa_utils.py
Replaced environment variable handling (VLLM_ATTENTION_BACKEND, VLLM_FLASH_ATTN_VERSION) and global-state context managers with config-driven backend selection. Removed legacy helpers (get_env_variable_attn_backend, global_force_attn_backend, global_force_attn_backend_context_manager). Backend now derived from vllm_config.attention_config.backend and passed explicitly through function chains.
Abstract & Base Attention Layer
vllm/attention/backends/abstract.py, vllm/attention/layer.py
Changed supports_quant_query_input from instance method to boolean attribute in AttentionImpl. Updated AttentionLayer to check attribute instead of calling method.
V0 Attention Backends
vllm/attention/...
(No direct changes to v0 backends beyond abstract class)
V1 Attention Backends
vllm/v1/attention/backends/flash_attn.py, vllm/v1/attention/backends/flashinfer.py, vllm/v1/attention/backends/triton_attn.py, vllm/v1/attention/backends/rocm_attn.py, vllm/v1/attention/backends/mla/...
Replaced environment variable access with config-driven values via get_current_vllm_config(). Flash attention now reads flash_attn_max_num_splits_for_cuda_graph from config. FlashInfer backends read disable_flashinfer_q_quantization from config. MLA backends query config for prefill and CUDA graph settings. Converted supports_quant_query_input() methods to instance attributes initialized in __init__.
Engine Configuration
vllm/engine/arg_utils.py
Added AttentionConfig and attention_backend fields to EngineArgs. Introduced CLI argument --attention-backend tied to AttentionConfig.backend. Engine assembly now passes attention_config to VllmConfig. Added dict-to-object conversion for attention_config in __post_init__.
Model Executor Configuration & Vision
vllm/model_executor/models/config.py, vllm/model_executor/models/vision.py, vllm/config/model.py
Replaced environment variable backend checks (VLLM_ATTENTION_BACKEND) with enum-based checks using AttentionBackendEnum and attention_config.backend. Removed flashinfer dependency check from ModelConfig.__post_init__. Vision transformer backend selection now reads from global config instead of env var.
Platform-Specific Logic
vllm/platforms/cuda.py
Replaced VLLM_ATTENTION_BACKEND environment variable with vllm_config.attention_config.backend for CUDA platform backend selection. When backend is unset, defaults are applied (CUTLASS_MLA for Blackwell, FlashMLA otherwise) and stored back in config.
Utility Functions
vllm/utils/flashinfer.py
Rewrote force_use_trtllm_attention() to read from config (vllm_config.attention_config.use_trtllm_attention) instead of environment. Removed _force_use_trtllm_attention() helper and flashinfer_disable_q_quantization export. Updated user-facing messages to reference CLI flag --attention-config.use_trtllm_attention.
Test Updates
tests/compile/test_fusion_attn.py, tests/v1/worker/test_gpu_model_runner.py, tests/v1/kv_connector/unit/test_nixl_connector.py
Migrated tests to use AttentionConfig(backend=...) in VllmConfig instead of context-manager-driven backend forcing. Added dynamic branching in KV cache tests to handle varying layout shapes. Updated test scaffolding to rely on config-driven backend selection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


🐰 Hops through configs with glee,
No more env vars, oh what spree!
Backend selection, now so clean,
Config-driven, crisp, serene!
Context managers fade away,
AttentionConfig rules the day! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: introducing AttentionConfig and transitioning from environment variables to configuration-based attention backend selection, which is the primary focus across multiple modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vllm/platforms/cuda.py (1)

175-211: Default FlashMLA intent doesn’t match auto-selection.

When attention_config.backend is None on non‑Blackwell MLA, the code sets use_flashmla = True but leaves the backend unset. Auto-selection still prioritizes FLASH_ATTN_MLA, while block-size forcing assumes FlashMLA. Consider either setting the backend when FlashMLA is actually intended or adjusting the comment/logic to reflect the real default.

✅ Suggested fix (set backend only when FlashMLA is supported)
-            if (
-                use_flashmla
-                and is_flashmla_dense_supported()[0]
-                and cache_config.block_size % 64 != 0
-            ):
-                cache_config.block_size = 64
-                logger.info("Forcing kv cache block size to 64 for FlashMLA backend.")
+            if use_flashmla and is_flashmla_dense_supported()[0]:
+                if vllm_config.attention_config.backend is None:
+                    vllm_config.attention_config.backend = (
+                        AttentionBackendEnum.FLASHMLA
+                    )
+                if cache_config.block_size % 64 != 0:
+                    cache_config.block_size = 64
+                    logger.info(
+                        "Forcing kv cache block size to 64 for FlashMLA backend."
+                    )
♻️ Duplicate comments (1)
vllm/attention/layer.py (1)

339-342: Same concern as above — attribute vs method.

🧹 Nitpick comments (4)
tests/v1/kv_connector/unit/test_nixl_connector.py (1)

1154-1176: Consider adding a brief comment explaining the layout detection strategy.

The logic correctly identifies the KV cache layout by probing with num_blocks=1. However, the reasoning behind test_shape[0] == 1 determining is_blocks_first is non-obvious. A comment would help future maintainers understand why this works:

  • FlashAttn shape: (2, num_blocks, ...) → when num_blocks=1, shape[0]=2
  • Triton shape: (num_blocks, 2, ...) → when num_blocks=1, shape[0]=1
📝 Suggested comment
     # Store tensor info for validation
     test_shape = backend_cls.get_kv_cache_shape(
         num_blocks=1, block_size=16, num_kv_heads=1, head_size=1
     )
+    # Detect if num_blocks is the first dimension by probing with num_blocks=1.
+    # FlashAttn: (2, num_blocks, ...) → shape[0]=2; Triton: (num_blocks, 2, ...) → shape[0]=1
     is_blocks_first = len(test_shape) == 5 and test_shape[0] == 1
vllm/v1/attention/backends/mla/flashattn_mla.py (1)

124-135: Validate flash_attn_max_num_splits_for_cuda_graph is positive. A zero/negative value could lead to invalid scheduling or buffer sizing during CUDA graph capture. Consider validating before assignment.

♻️ Proposed fix
         if self.use_full_cuda_graph and self.fa_aot_schedule:
             self.scheduler_metadata = torch.zeros(
                 vllm_config.scheduler_config.max_num_seqs + 1,
                 dtype=torch.int32,
                 device=self.device,
             )
             # When using cuda graph, we need to set the upper bound of the
             # number of splits so that large enough intermediate buffers are
             # pre-allocated during capture.
-            self.max_num_splits = (
-                vllm_config.attention_config.flash_attn_max_num_splits_for_cuda_graph
-            )
+            max_splits = (
+                vllm_config.attention_config.flash_attn_max_num_splits_for_cuda_graph
+            )
+            if max_splits <= 0:
+                raise ValueError(
+                    "flash_attn_max_num_splits_for_cuda_graph must be > 0"
+                )
+            self.max_num_splits = max_splits
tests/v1/worker/test_gpu_model_runner.py (1)

938-956: Remove duplicate Mamba KV-cache verification loop.

The second loop repeats the exact same assertions and can be dropped to reduce test time/noise.

♻️ Suggested cleanup
-    for layer in [layer_2, layer_3, layer_4, layer_5]:
-        for i, kv_block in enumerate(kv_blocks_for_mamba):
-            actual_conv = vllm_ctx[layer].kv_cache[0][0][kv_block, :]
-            actual_ssm = vllm_ctx[layer].kv_cache[0][1][kv_block, :]
-            expected_conv = conv_blocks_constant[i]
-            expected_ssm = ssm_blocks_constant[i]
-
-            assert torch.equal(actual_conv, expected_conv)
-            assert torch.equal(actual_ssm, expected_ssm)
vllm/attention/selector.py (1)

60-70: Consider adding a type annotation for the backend parameter.

The backend parameter lacks a type annotation, while other parameters are properly annotated. This reduces code clarity and IDE support.

Suggested fix
 `@cache`
 def _cached_get_attn_backend(
-    backend,
+    backend: str | None,
     head_size: int,
     dtype: torch.dtype,

Note: Adjust the type based on the actual type of attention_config.backend (likely an enum or string).

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dff0a2b and 66e674c.

📒 Files selected for processing (22)
  • tests/compile/test_fusion_attn.py
  • tests/v1/kv_connector/unit/test_nixl_connector.py
  • tests/v1/worker/test_gpu_model_runner.py
  • vllm/attention/backends/abstract.py
  • vllm/attention/layer.py
  • vllm/attention/selector.py
  • vllm/attention/utils/fa_utils.py
  • vllm/config/__init__.py
  • vllm/config/attention.py
  • vllm/config/model.py
  • vllm/config/vllm.py
  • vllm/engine/arg_utils.py
  • vllm/model_executor/models/config.py
  • vllm/model_executor/models/vision.py
  • vllm/platforms/cuda.py
  • vllm/utils/flashinfer.py
  • vllm/v1/attention/backends/flash_attn.py
  • vllm/v1/attention/backends/flashinfer.py
  • vllm/v1/attention/backends/mla/common.py
  • vllm/v1/attention/backends/mla/flashattn_mla.py
  • vllm/v1/attention/backends/rocm_attn.py
  • vllm/v1/attention/backends/triton_attn.py
💤 Files with no reviewable changes (1)
  • vllm/config/model.py
🧰 Additional context used
🧬 Code graph analysis (16)
vllm/config/__init__.py (1)
vllm/config/attention.py (1)
  • AttentionConfig (18-114)
vllm/attention/utils/fa_utils.py (2)
vllm/config/utils.py (1)
  • config (35-48)
vllm/config/vllm.py (1)
  • get_current_vllm_config (1310-1317)
vllm/v1/attention/backends/mla/common.py (1)
vllm/config/vllm.py (1)
  • get_current_vllm_config (1310-1317)
vllm/config/vllm.py (1)
vllm/config/attention.py (1)
  • AttentionConfig (18-114)
vllm/model_executor/models/vision.py (1)
vllm/config/vllm.py (2)
  • VllmConfig (175-1247)
  • get_current_vllm_config (1310-1317)
vllm/engine/arg_utils.py (4)
vllm/config/attention.py (1)
  • AttentionConfig (18-114)
vllm/config/utils.py (1)
  • get_field (51-73)
vllm/config/vllm.py (1)
  • VllmConfig (175-1247)
vllm/attention/backends/registry.py (1)
  • AttentionBackendEnum (34-123)
vllm/v1/attention/backends/triton_attn.py (1)
vllm/platforms/interface.py (1)
  • is_cuda (150-151)
vllm/platforms/cuda.py (2)
vllm/attention/backends/registry.py (1)
  • AttentionBackendEnum (34-123)
vllm/platforms/interface.py (1)
  • is_device_capability (280-301)
tests/v1/worker/test_gpu_model_runner.py (3)
vllm/attention/backends/registry.py (1)
  • AttentionBackendEnum (34-123)
vllm/config/attention.py (1)
  • AttentionConfig (18-114)
vllm/config/vllm.py (1)
  • VllmConfig (175-1247)
vllm/v1/attention/backends/flashinfer.py (2)
vllm/config/vllm.py (2)
  • VllmConfig (175-1247)
  • get_current_vllm_config (1310-1317)
vllm/utils/flashinfer.py (1)
  • can_use_trtllm_attention (282-287)
vllm/config/attention.py (3)
vllm/attention/backends/registry.py (1)
  • AttentionBackendEnum (34-123)
vllm/config/utils.py (3)
  • config (35-48)
  • get_hash_factors (276-294)
  • hash_factors (297-299)
vllm/logger.py (2)
  • init_logger (206-216)
  • warning_once (138-147)
tests/v1/kv_connector/unit/test_nixl_connector.py (7)
vllm/attention/backends/abstract.py (1)
  • get_kv_cache_shape (71-78)
vllm/v1/attention/backends/flash_attn.py (1)
  • get_kv_cache_shape (102-111)
vllm/v1/attention/backends/flashinfer.py (1)
  • get_kv_cache_shape (306-313)
vllm/v1/attention/backends/mla/common.py (1)
  • get_kv_cache_shape (302-309)
vllm/v1/attention/backends/rocm_attn.py (1)
  • get_kv_cache_shape (181-190)
vllm/v1/attention/backends/triton_attn.py (1)
  • get_kv_cache_shape (177-186)
vllm/distributed/kv_transfer/kv_connector/utils.py (1)
  • block_size (233-234)
vllm/model_executor/models/config.py (1)
vllm/attention/backends/registry.py (1)
  • AttentionBackendEnum (34-123)
vllm/attention/selector.py (1)
vllm/config/vllm.py (1)
  • get_current_vllm_config (1310-1317)
tests/compile/test_fusion_attn.py (1)
vllm/config/attention.py (1)
  • AttentionConfig (18-114)
vllm/utils/flashinfer.py (3)
vllm/config/utils.py (1)
  • config (35-48)
vllm/config/vllm.py (1)
  • get_current_vllm_config (1310-1317)
vllm/logger.py (1)
  • info_once (129-136)
🪛 Ruff (0.14.13)
vllm/engine/arg_utils.py

531-531: Do not perform function call get_field in dataclass defaults

(RUF009)


1718-1721: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (47)
tests/v1/kv_connector/unit/test_nixl_connector.py (2)

1233-1237: LGTM!

The block length calculation correctly accounts for the different tensor layouts. For Triton (blocks-first), the full tensor size needs division by both num_blocks and 2 (K/V split). For FlashAttn, expected_tensor_size is already halved, so only division by num_blocks is needed. Both approaches correctly yield the per-block-per-KV size.


1211-1211: LGTM!

The assertion correctly uses the dynamically computed expected_num_entries instead of a hardcoded value, allowing the test to validate both FlashAttn (4 entries) and Triton (2 entries) cache registration patterns.

vllm/v1/attention/backends/rocm_attn.py (1)

162-169: Clearer CLI guidance for unsupported head sizes. Looks good.

vllm/config/__init__.py (2)

4-5: Import of AttentionConfig is appropriate.


49-52: __all__ exposure for AttentionConfig looks correct.

vllm/attention/utils/fa_utils.py (1)

51-56: Config-driven FA version override looks good.

vllm/model_executor/models/vision.py (2)

14-15: Config import is appropriate here.


94-96: Config-based backend selection is consistent with the refactor.

vllm/attention/layer.py (1)

303-307: The concern raised in this review is unfounded. The base class AttentionImpl defines supports_quant_query_input as a bool class attribute with a default value of False (vllm/attention/backends/abstract.py:300). All subclasses that override it do so correctly with boolean attribute assignments, never method definitions. Therefore, attribute access at lines 306 and 341 will always return a bool value, and the check is safe.

Likely an incorrect or invalid review comment.

vllm/attention/backends/abstract.py (1)

292-301: LGTM: capability flag is clear and consistent.

The attribute-based capability flag aligns well with the config-driven checks and avoids per-call method overhead.

vllm/v1/attention/backends/triton_attn.py (1)

262-263: LGTM: instance flag preserves prior behavior.

Setting the flag in __init__ keeps parity with the previous method-based check.

vllm/config/vllm.py (3)

30-31: LGTM: AttentionConfig import is in place.

Clean wiring for the new config type.


196-197: LGTM: attention_config default factory is appropriate.

Ensures the config is always present without requiring callers to pass it.


285-288: LGTM: hashing includes attention_config.

Good to include attention settings in the computation-graph hash factors.

vllm/model_executor/models/config.py (2)

7-8: LGTM: enum-based backend selection.

Improves type safety over string/env checks.


334-369: LGTM: config-driven backend checks read cleanly.

Using attention_config.backend and allowing None keeps the logic consistent with the new config flow.

tests/compile/test_fusion_attn.py (2)

20-22: LGTM: test imports align with new public API.

Clear and direct use of AttentionConfig.


338-339: LGTM: explicit attention_config improves test determinism.

Backend is now controlled via config instead of external state.

vllm/v1/attention/backends/mla/common.py (3)

438-450: Config-driven FlashInfer prefill gating looks good.

The new AttentionConfig-based checks keep the existing selection logic while centralizing the flags.


453-462: CUDNN prefill selection remains consistent under config control.

Reads from AttentionConfig as expected and preserves the existing capability checks.


465-474: TRTLLM ragged DeepSeek prefill gating is aligned with new config flow.

Looks consistent with the updated configuration source.

vllm/engine/arg_utils.py (8)

36-38: Import wiring for AttentionConfig is straightforward.

No concerns with the added config import.


547-548: Top-level attention_backend override is cleanly surfaced.

The extra field provides a simple CLI path while still deferring to AttentionConfig merging later.


584-587: Dict-to-AttentionConfig coercion is a nice usability touch.

This mirrors other config coercions and keeps EngineArgs flexible.


725-734: New AttentionConfig CLI group is nicely scoped.

Adds the expected --attention-backend surface without polluting other groups.


1158-1160: Passing --attention-config through VllmConfig is consistent with other config knobs.

The dataclass JSON route looks aligned with existing patterns.


1714-1723: Mutual-exclusion logic for attention_backend vs attention_config.backend is solid.

This prevents ambiguous overrides and keeps config resolution deterministic.


1781-1783: AttentionConfig is now threaded into VllmConfig as intended.

Good to see it wired into the final config assembly.


531-531: Do not apply the suggested fix; it breaks the semantic intent of the code.

While RUF009 is correctly triggered, the suggested fix of using field(default_factory=AttentionConfig) is semantically incorrect. The get_field() function (from vllm.config.utils) does not simply create a field with the type as a factory—it extracts the actual default_factory from the source dataclass field. Using field(default_factory=AttentionConfig) would call AttentionConfig() directly rather than using the configured default factory from VllmConfig.attention_config, changing the runtime behavior.

The correct approach is to suppress the RUF009 rule for this line (e.g., # noqa: RUF009) or refactor the code to avoid the function call in the default, rather than replacing get_field() with an incompatible alternative.

Likely an incorrect or invalid review comment.

vllm/utils/flashinfer.py (2)

270-279: Config-backed TRTLLM override hook looks good.

Returning the AttentionConfig flag keeps the override source centralized.


305-369: Updated TRTLLM selection messaging and gating are consistent.

The explicit enable/disable paths now reference the new CLI/config flag appropriately.

tests/v1/worker/test_gpu_model_runner.py (3)

8-13: Imports updated for config-driven backend selection.

No issues with the new test dependencies.


811-818: Test now uses AttentionConfig to select FLASHINFER backend.

This aligns with the new config-driven attention selection path.


827-857: Scoped set_current_vllm_config usage looks correct.

Keeps the static forward context and runner initialization consistent with the active config.

vllm/config/attention.py (4)

16-49: AttentionConfig fields cover the needed backend and kernel knobs.

Good consolidation of attention-related flags into a single config object.


51-63: compute_hash integration looks appropriate for graph-affecting knobs.

This should keep caching keyed to attention configuration changes.


65-71: String-to-enum backend parsing is a useful usability improvement.

Nice touch for CLI/config inputs.


73-114: Env-var migration warnings and overrides are handled cleanly.

Centralizing the deprecation path here keeps callers simpler.

vllm/v1/attention/backends/flashinfer.py (4)

29-29: LGTM!

Import addition is consistent with the codebase pattern for accessing vLLM configuration.


364-366: LGTM!

Comment accurately reflects the config-driven approach for disabling TRTLLM attention.


503-513: LGTM!

The config-driven q_data_type selection logic is correct. The condition properly checks both TRTLLM availability and the quantization disable flag before selecting the appropriate data type.


1041-1045: LGTM!

The attribute initialization correctly combines TRTLLM support check with the config-driven quantization disable flag. The use of get_current_vllm_config() follows the established pattern in this codebase.

Note: get_current_vllm_config() may log a warning and return a default config during CI/testing when the config is not explicitly set, but this is the expected behavior per the function's design.

vllm/v1/attention/backends/flash_attn.py (3)

266-266: LGTM!

Storing attention_config reference follows the same pattern as other config fields in this builder (e.g., model_config, cache_config).


307-309: LGTM!

Config-driven access for flash_attn_max_num_splits_for_cuda_graph replaces the environment variable correctly.


559-559: LGTM!

FlashAttention unconditionally supports quantized query input, which differs from FlashInfer's conditional support. Converting from a method to an attribute is consistent with the refactoring across attention backends.

vllm/attention/selector.py (2)

41-56: LGTM!

The refactoring correctly retrieves the attention backend from vllm_config.attention_config.backend and passes it to the cached function. The import is placed inside the function to avoid circular imports, which is a common pattern in this codebase.


80-103: LGTM!

The backend parameter is correctly passed to get_attn_backend_cls in both code paths (with and without the deprecated use_v1 parameter).

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants